Skip to content

Vace finetuning#3

Open
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft
Open

Vace finetuning#3
Tatiana21 wants to merge 54 commits intohuvunvidia:mainfrom
NeverMore960114:vace_ft

Conversation

@Tatiana21
Copy link
Copy Markdown

Code for:

  1. Creating segmentation masks for Inpainting tasks for a video dataset, based on open-sora-plan dataset format
  2. Preprocessing and creating energon dataset for finetuing with T2V, I2V and V2V tasks
  3. Finetuning vace, for T2V, I2V and V2V tasks

Refer to annotators/Inpainting/run_batch_process.sh for segmentation.
Refer to example_commands.sh for commands to process datasets and launch training.

abhinavg4 and others added 30 commits September 30, 2025 14:23
* fix cpu init during export

Signed-off-by: yaoyu-33 <[email protected]>

* export env fix

Signed-off-by: yaoyu-33 <[email protected]>

* delete_extra_state for TE related during checkpoint loading for export

Signed-off-by: yaoyu-33 <[email protected]>

* paths fixes

Signed-off-by: yaoyu-33 <[email protected]>

* add override_provider option for checkpoint loading

Signed-off-by: yaoyu-33 <[email protected]>

* add unit test for override_provider option

Signed-off-by: yaoyu-33 <[email protected]>

* remove debug lines

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* unit test fix

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
* chore: Add issue template for model requests

Signed-off-by: oliver könig <[email protected]>

* copying over remaining templates

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* ci: Skip if `docs-only` label is attached

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* update

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* cleanup process group at end of performance script

Signed-off-by: Ananth Subramaniam <[email protected]>

* Update scripts/performance/run_script.py

Signed-off-by: Ananth Subramaniam <[email protected]>

* destroy pg for other scripts

Signed-off-by: Ananth Subramaniam <[email protected]>

* update

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
* ci(fix): pre-flight

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* test

Signed-off-by: oliver könig <[email protected]>

* final

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
* initial gemma commit

Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma provider

Signed-off-by: Ananth Subramaniam <[email protected]>

* patch tests

Signed-off-by: Ananth Subramaniam <[email protected]>

* add gemma bridge + tests

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix conftest

Signed-off-by: Ananth Subramaniam <[email protected]>

* reenable msc

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix gemma test fallback

Signed-off-by: Ananth Subramaniam <[email protected]>

* try simpler tokenizer

Signed-off-by: Ananth Subramaniam <[email protected]>

* upload assets

Signed-off-by: Ananth Subramaniam <[email protected]>

* use pre-downloaded config for model provider test

Signed-off-by: Ananth Subramaniam <[email protected]>

* lint

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback -s

Signed-off-by: Ananth Subramaniam <[email protected]>

* rebase

Signed-off-by: Ananth Subramaniam <[email protected]>

* rebase

Signed-off-by: Ananth Subramaniam <[email protected]>

* use mcore activations

Signed-off-by: Ananth Subramaniam <[email protected]>

* update test

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix mock

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix conversion script reference

Signed-off-by: Ananth Subramaniam <[email protected]>

* subclass

Signed-off-by: Ananth Subramaniam <[email protected]>

* update tests

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <[email protected]>

* [docs] packed sequences

Signed-off-by: Ananth Subramaniam <[email protected]>

* address feedback

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* gemma2 provider and bridge

Signed-off-by: Ananth Subramaniam <[email protected]>

* gemma2 model provider + bridge

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* docs] placeholder page for performance summary

Signed-off-by: Ananth Subramaniam <[email protected]>

* add sections for releases

Signed-off-by: Ananth Subramaniam <[email protected]>

* improve description

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
… compatibility (NVIDIA-NeMo#829)

* save latest_checkpointed_iteration for compatibility

Signed-off-by: Ananth Subramaniam <[email protected]>

* fix megatron fsdp test assertion

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* exit profiler context

Signed-off-by: Ananth Subramaniam <[email protected]>

* disable vocab size logging in flops calculation

Signed-off-by: Ananth Subramaniam <[email protected]>

---------

Signed-off-by: Ananth Subramaniam <[email protected]>
* Clear disk space before install check

Signed-off-by: Charlie Truong <[email protected]>

* Revert "Clear disk space before install check"

This reverts commit 2c085f5.

Signed-off-by: Charlie Truong <[email protected]>

* Run bare metal install on self-hosted runners

Signed-off-by: Charlie Truong <[email protected]>

---------

Signed-off-by: Charlie Truong <[email protected]>
…A-NeMo#607)

* update llama and qwen models to use auto bridge and update recipes test as well

Signed-off-by: yaoyu-33 <[email protected]>

* temporary remove llama4 as it's not fully tested or verified.

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "temporary remove llama4 as it's not fully tested or verified."

This reverts commit 5217084.

* temp save

Signed-off-by: yaoyu-33 <[email protected]>

* temp save

Signed-off-by: yaoyu-33 <[email protected]>

* Revert "temp save"

This reverts commit 0c57e2b.

* Revert "temp save"

This reverts commit 0748d52.

* update qwen's recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update llama recipes

Signed-off-by: yaoyu-33 <[email protected]>

* remove some old recipe files

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe files to match old recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe file

Signed-off-by: yaoyu-33 <[email protected]>

* update qwen recipes

Signed-off-by: yaoyu-33 <[email protected]>

* update llama recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/qwen/qwen3.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* Update src/megatron/bridge/recipes/llama/llama2.py

Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Yu Yao <[email protected]>

* recipe naming update

Signed-off-by: yaoyu-33 <[email protected]>

* update test

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* add TypedDict for args

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* update docstring

Signed-off-by: yaoyu-33 <[email protected]>

* unit test fix and license fix

Signed-off-by: yaoyu-33 <[email protected]>

* sync eval_interval and save_interval

Signed-off-by: yaoyu-33 <[email protected]>

* add comments

Signed-off-by: yaoyu-33 <[email protected]>

* set TRANSFORMERS_OFFLINE=1 in action.yml

Signed-off-by: yaoyu-33 <[email protected]>

* fix llama3 8b hf model path

Signed-off-by: yaoyu-33 <[email protected]>

* replay lr decay iters update on updated recipes

Signed-off-by: yaoyu-33 <[email protected]>

* Update action.yml

Signed-off-by: Yu Yao <[email protected]>

* add comments

Signed-off-by: yaoyu-33 <[email protected]>

* Add guard / mock for the places needs to download hf config in unit test

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

* add qwen functional test

Signed-off-by: yaoyu-33 <[email protected]>

* update recipe tests

Signed-off-by: yaoyu-33 <[email protected]>

* lint

Signed-off-by: yaoyu-33 <[email protected]>

---------

Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: Yu Yao <[email protected]>
Co-authored-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
Signed-off-by: Ananth Subramaniam <[email protected]>
…ation support

- Introduced `pretrain_DiT_Model.py` for flexible pretraining using Megatron-Bridge.
- Updated `DITForwardStep` class to use `__call__` method for forward steps.
- Modified dataset configuration in `pretrain_config` to utilize `DiffusionDataModule`.
- Adjusted tensor and context parallelism settings in `llama3_8b.py`.

This commit enhances the pretraining capabilities and configuration flexibility for Llama3 models.
abhinavg4 and others added 24 commits October 6, 2025 09:33
- Commented out sections in `pretrain_DiT_Model.py` related to OmegaConf merging and command-line overrides for clarity.
- Added `backend` configuration in `llama3_8b_pretrain_override_example.yaml`.
- Updated `init_global_step` handling in `EnergonMultiModalDataModule` to simplify initialization.
- Introduced `DiffusionDataModuleConfig` for better dataset configuration management.
- Adjusted model parameters in `llama_provider.py` to set `num_layers` to 2 and added `seq_length` and `vocab_size` attributes in `DiTModelProvider`.
- Refined imports across various modules to ensure consistency and clarity.

This commit enhances the configuration structure and model initialization process, improving maintainability and usability.
Copilot AI review requested due to automatic review settings December 10, 2025 23:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces comprehensive support for Vace (Video Auto-Context Encoding) finetuning capabilities, including tools for dataset preparation, preprocessing pipelines, and training infrastructure for T2V (Text-to-Video), I2V (Image-to-Video), and V2V (Video-to-Video) tasks. The implementation extends the existing WAN (Wide Attention Network) model architecture with VACE-specific layers and flow-matching training pipelines.

Key Changes

  • Added VACE model architecture with context and base layers for video editing tasks
  • Implemented flow matching training pipeline with configurable timestep sampling strategies
  • Created preprocessing utilities for video/image/mask data and segmentation mask generation
  • Added Gemma model family support (Gemma 1.0 and Gemma 2.0) with proper embedding scaling

Reviewed changes

Copilot reviewed 108 out of 229 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/megatron/bridge/models/wan/wan_layer_spec.py Defines WAN transformer layer specifications including VACE-specific base and context layers with adaptive layer normalization
src/megatron/bridge/models/wan/wan_bridge.py Implements parameter mapping bridges between HuggingFace and Megatron formats for WAN and VACE models
src/megatron/bridge/models/wan/utils/utils.py Provides utility functions for grid size calculation, patching/unpatching, and context parallelism operations
src/megatron/bridge/models/wan/utils/preprocessor.py Implements video and image preprocessing classes with resizing, cropping, and normalization capabilities
src/megatron/bridge/models/wan/rope_utils.py Implements 3D RoPE (Rotary Position Embeddings) for spatial-temporal attention in video models
src/megatron/bridge/models/wan/modules/vae.py Defines VAE encoder/decoder architecture with causal 3D convolutions for video latent encoding
src/megatron/bridge/models/wan/modules/tokenizers.py Provides HuggingFace tokenizer wrapper with text cleaning utilities
src/megatron/bridge/models/wan/modules/t5.py Implements T5 encoder/decoder models with custom layer normalization and attention mechanisms
src/megatron/bridge/models/wan/flow_matching/time_shift_utils.py Implements timestep sampling strategies and sigma computation for flow matching training
src/megatron/bridge/models/wan/flow_matching/flow_pipeline.py Defines training pipeline for flow matching with support for both WAN and VACE models
src/megatron/bridge/models/wan/flow_matching/flow_inference_pipeline.py Implements inference pipeline with DPM/UniPC solvers and pipeline parallelism support
src/megatron/bridge/models/wan/inference/configs/*.py Configuration files for different WAN model variants (T2V, I2V, VACE) with size-specific settings
src/megatron/bridge/models/gemma/*.py Adds complete Gemma model family support with proper embedding scaling and configuration mappings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +232 to +233
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
query = query.contiguous() # important because TE attention expects contiguous tensors
key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot uses AI. Check for mistakes.
Comment on lines +359 to +360
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'becuase' to 'because' in both comments.

Suggested change
query = query.contiguous() # important becuase TE attention expects contiguous tensors
key = key.contiguous() # important becuase TE attention expects contiguous tensors
query = query.contiguous() # important because TE attention expects contiguous tensors
key = key.contiguous() # important because TE attention expects contiguous tensors

Copilot uses AI. Check for mistakes.

class CausalConv3d(nn.Conv3d):
"""
Causal 3d convolusion.
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'convolusion' to 'convolution' in docstring.

Suggested change
Causal 3d convolusion.
Causal 3d convolution.

Copilot uses AI. Check for mistakes.
Comment on lines +1014 to +1018
Input_frames (`list[Tensor]`):
Input frames for content generation
Input_masks (`list[Tensor]`):
Input masks for content generation
Input_ref_images (`list[Tensor]`):
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter names should follow snake_case convention. These should be input_frames, input_masks, and input_ref_images instead of capitalized versions.

Suggested change
Input_frames (`list[Tensor]`):
Input frames for content generation
Input_masks (`list[Tensor]`):
Input masks for content generation
Input_ref_images (`list[Tensor]`):
input_frames (`list[Tensor]`):
Input frames for content generation
input_masks (`list[Tensor]`):
Input masks for content generation
input_ref_images (`list[Tensor]`):

Copilot uses AI. Check for mistakes.
"""Configuration for a 2B parameter Code Gemma model.

Extends GemmaModelProvider with specific settings for code generation.
Thism model has an identical configuration to GemmaModelProvider2B.
Copy link

Copilot AI Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Thism' to 'This' in docstring.

Suggested change
Thism model has an identical configuration to GemmaModelProvider2B.
This model has an identical configuration to GemmaModelProvider2B.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants